speaker encoder
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers without transcripts, to generate a fixed-dimensional embedding vector from seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2, which generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder that converts the mel spectrogram into a sequence of time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the new task, and is able to synthesize natural speech from speakers that were not seen during training. We quantify the importance of training the speaker encoder on a large and diverse speaker set in order to obtain the best generalization performance. Finally, we show that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.
- North America > United States (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- North America > United States > California > Santa Clara County > Sunnyvale (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia (0.04)
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers without transcripts, to generate a fixed-dimensional embedding vector from seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2, which generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder that converts the mel spectrogram into a sequence of time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the new task, and is able to synthesize natural speech from speakers that were not seen during training. We quantify the importance of training the speaker encoder on a large and diverse speaker set in order to obtain the best generalization performance. Finally, we show that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.
- North America > United States (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- North America > United States > California > Santa Clara County > Sunnyvale (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia (0.04)
The First Voice Timbre Attribute Detection Challenge
Chen, Liping, He, Jinghao, Sheng, Zhengyan, Lee, Kong Aik, Ling, Zhen-Hua
The first voice timbre attribute detection challenge is featured in a special session at NCMMSC 2025. It focuses on the explainability of voice timbre and compares the intensity of two speech utterances in a specified timbre descriptor dimension. The evaluation was conducted on the VCTK-RVA dataset. Participants developed their systems and submitted their outputs to the organizer, who evaluated the performance and sent feedback to them. Six teams submitted their outputs, with five providing descriptions of their methodologies.
- Asia > China > Hong Kong (0.06)
- North America > United States (0.05)
- Asia > China > Anhui Province > Hefei (0.05)
MahaTTS: A Unified Framework for Multilingual Text-to-Speech Synthesis
Singh, Jaskaran, Chowdhury, Amartya Roy, Prabhakar, Raghav, W, Varshul C.
Current Text-to-Speech models pose a multilingual challenge, where most of the models traditionally focus on English and European languages, thereby hurting the potential to provide access to information to many more people. To address this gap, we introduce MahaTTS-v2 a Multilingual Multi-speaker Text-To-Speech (TTS) system that has excellent multilingual expressive capabilities in Indic languages. The model has been trained on around 20K hours of data specifically focused on Indian languages. Our approach leverages Wav2Vec2.0 tokens for semantic extraction, and a Language Model (LM) for text-to-semantic modeling. Additionally, we have used a Conditional Flow Model (CFM) for semantics to melspectogram generation. The experimental results indicate the effectiveness of the proposed approach over other frameworks. Our code is available at https://github.com/dubverse-ai/MahaTTSv2
A2TTS: TTS for Low Resource Indian Languages
Bhadoriya, Ayush Singh, Shinde, Abhishek Nikunj, Pandey, Isha, Ramakrishnan, Ganesh
We present a speaker conditioned text-to-speech (TTS) system aimed at addressing challenges in generating speech for unseen speakers and supporting diverse Indian languages. Our method leverages a diffusion-based TTS architecture, where a speaker encoder extracts embeddings from short reference audio samples to condition the DDPM decoder for multispeaker generation. To further enhance prosody and naturalness, we employ a cross-attention based duration prediction mechanism that utilizes reference audio, enabling more accurate and speaker consistent timing. This results in speech that closely resembles the target speaker while improving duration modeling and overall expressiveness. Additionally, to improve zero-shot generation, we employed classifier free guidance, allowing the system to generate speech more near speech for unknown speakers. Using this approach, we trained language-specific speaker-conditioned models. Using the IndicSUPERB dataset for multiple Indian languages such as Bengali, Gujarati, Hindi, Marathi, Malayalam, Punjabi and Tamil.
Introducing voice timbre attribute detection
He, Jinghao, Sheng, Zhengyan, Chen, Liping, Lee, Kong Aik, Ling, Zhen-Hua
This paper focuses on explaining the timbre conveyed by speech signals and introduces a task termed voice timbre attribute detection (vTAD). In this task, voice timbre is explained with a set of sensory attributes describing its human perception. A pair of speech utterances is processed, and their intensity is compared in a designated timbre descriptor. Moreover, a framework is proposed, which is built upon the speaker embeddings extracted from the speech utterances. The investigation is conducted on the VCTK-RVA dataset. Experimental examinations on the ECAPA-TDNN and FACodec speaker encoders demonstrated that: 1) the ECAPA-TDNN speaker encoder was more capable in the seen scenario, where the testing speakers were included in the training set; 2) the FACodec speaker encoder was superior in the unseen scenario, where the testing speakers were not part of the training, indicating enhanced generalization capability. The VCTK-RVA dataset and open-source code are available on the website https://github.com/vTAD2025-Challenge/vTAD.